Project Description

Objective

Use INE mobility data to explore the value of CORINE land cover data when modeling human mobility in cities.

Questions

Can random forest regression with land cover variables improve predictions compared to a simple linear gravity model?

Given the large number of land cover variables (52, when including combined origin and destination data) and their uncertain relationship to human mobility, random forests may help incorporate this data into a model without making assumptions about interactions (i.e. between origin and destination variables) or transformations (i.e. log scale). Additionally, random forest models indicate the relative importance of each variable, which may also be interesting.

Which model provides the best predictions for new cities (and why)?

Given that INE provides mobility data that covers nearly all residents of Spain, a model of mobility is more useful if it can provide accurate predictions for other cities. To test this, I model mobility using data for the ten cities with the largest number of mobility areas (Madrid, Barcelona, Sevilla, Valencia, Zaragoza, Malaga, Las Palmas de Gran Canaria, Cordoba, Bilbao, and Palma de Mallorca) and the test by leaving one city out sequentially and remodeling the data the testing it on the left-out city. I compare predictions using the Common Part of Commuters metric frequently used in the literature, along with the root mean of squared errors.

When mapped, are there substantive difference between the model predictions?

The RF models require vastly more time and computational resources than the simple gravity models. As such, any improvements should be substantively different, not just statistically different, for this method to be useful for policymakers, for example, in cities which do not have ground-truth mobility data to rely on.

One way to check for substantive differences is to map the predictions of combined models, alongside the observed mobility in those cities.

Data

INE provides data on flows between mobility areas (roughly barri-sized) for all Wednesdays and Sundays since the beginning of the pandemic. I have limited the data to only September and October of 2021, as these were the months where Covid cases and restrictions make it most likely to mobility approximated “normality.” For the analysis described here, I have also limited it to only Wednesdays. As mentioned before, the data used for modeling includes only mobility between areas within the municipalities of the ten cities. For each of INE’s mobility districts, I find the proportion of each of the land cover types within that district.

Models

Linear Gravity Model (LM)

Linear multilevel model of flows between mobility areas with the following independent variables: population of destination and origin, area of destination and origin, distance between destination and origin, and city as a level. All numeric variables are on the log scale.

Modeled using lme4.

Random Forest (RF)

Random forest regression with 500 trees including all of the numeric variables in the LM, the 52 land cover variables, and dummy variables for the ten cities.

Modeled using randomForest.

Analysis

Can random forest regression with land cover variables improve predictions compared to a simple linear gravity model?

First, summary information about the two models:

rf
## 
## Call:
##  randomForest(formula = flujo ~ ., data = train.all, importance = T) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 22
## 
##           Mean of squared residuals: 740.4622
##                     % Var explained: 95.71
varImpPlot(rf, type = 1)

summary(lm)
## Linear mixed model fit by REML ['lmerMod']
## Formula: 
## log(flujo) ~ log(pob_destino) + log(area_destino) + log(pob_residencia) +  
##     log(area_residencia) + log(dist) + (1 | city_destino)
##    Data: data
## 
## REML criterion at convergence: 251887
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.0675 -0.7084 -0.0054  0.7048  4.8286 
## 
## Random effects:
##  Groups       Name        Variance Std.Dev.
##  city_destino (Intercept) 0.1712   0.4137  
##  Residual                 0.4184   0.6468  
## Number of obs: 128020, groups:  city_destino, 10
## 
## Fixed effects:
##                       Estimate Std. Error t value
## (Intercept)          -3.567521   0.153056  -23.31
## log(pob_destino)      0.186539   0.005465   34.13
## log(area_destino)     0.341906   0.001701  201.01
## log(pob_residencia)   0.564808   0.005646  100.04
## log(area_residencia)  0.173799   0.001775   97.92
## log(dist)            -0.862350   0.003114 -276.94
## 
## Correlation of Fixed Effects:
##             (Intr) lg(pb_d) lg(r_d) lg(pb_r) lg(r_r)
## lg(pb_dstn) -0.330                                  
## log(r_dstn) -0.083 -0.179                           
## lg(pb_rsdn) -0.343  0.013    0.063                  
## lg(r_rsdnc) -0.081 -0.004    0.221  -0.163          
## log(dist)    0.015  0.001   -0.505  -0.064   -0.420
tibble(model = c("RF","LM"), rmse = c(sqrt(mean(data$errors_rf^2)),sqrt(mean(data$errors_lm^2))), cpc = c(sum(2*data$min_rf)/(sum(data$flujo)+sum(data$flujo_pred_rf)),sum(2*data$min_lm)/(sum(data$flujo)+sum(data$flujo_pred_lm))))
## # A tibble: 2 × 3
##   model  rmse   cpc
##   <chr> <dbl> <dbl>
## 1 RF     24.1 0.938
## 2 LM     98.7 0.719

We can see that the RF model (RMSE: 24.1, CPC: .93) outperforms the gravity model (RMSE: 98.7, CPC: .72) by a wide margin, when modeling the full data. In the variable importance plot, we can see that the gravity model variables (population, area, distance) are all among the most important, though the area vars and destination population are outranked by several of the land cover vars. This indicates, unsurprisingly, that human mobility in cities is influenced by the character of neighborhoods not just the density.

Another interesting note about the variable plot is that origin variables ("_residencia“) appear to be more important than destination ones (”_destino"). The chart below summarizes:

## # A tibble: 4 × 2
##   variables                pct_inc_mse
##   <chr>                          <dbl>
## 1 Origin (All)                    35.3
## 2 Destination (All)               21.8
## 3 Origin (Land Cover)             32.5
## 4 Destination (Land Cover)        20.1

The measure “pct_inc_mse” is the percent the mean squared error increases if a given variable is removed from the model. We can see that the origin variables are indeed better predictors, on average, than the destination variables. This indicates that human mobility in these cities is more dependent on “push” factors than “pull” ones.

Which model provides the best predictions for new cities (and why)?

The chart below reports the RMSE and CPC of predictions for each of the ten cities. The given city was treated as the “test” data while the other nine served as the “train” data, repeated for each one. The Madrid, Bilbao, and Malaga variables ranked highly in importance above, which indicates that those cities are particularly distinct in their mobility patterns, so I would expect the models to struggle with them. Also, RF struggles to make predictions for out-of-sample values, so I expect the linear model may do better with the larger cities.

## # A tibble: 10 × 7
##    city                     rmse_rf rmse_lm better_rmse cpc_rf cpc_lm better_cpc
##    <chr>                      <dbl>   <dbl> <chr>        <dbl>  <dbl> <chr>     
##  1 Zaragoza                   114.    113.  LM           0.728  0.732 LM        
##  2 Valencia                   103.    116.  RanFor       0.737  0.710 RanFor    
##  3 Barcelona                   86.3    67.3 LM           0.699  0.742 LM        
##  4 Sevilla                    108.    114.  RanFor       0.720  0.710 RanFor    
##  5 Cordoba                    114.    110.  LM           0.694  0.689 RanFor    
##  6 Las Palmas de Gran Cana…   138.    145.  RanFor       0.681  0.640 RanFor    
##  7 Bilbao                     281.    306.  RanFor       0.678  0.623 RanFor    
##  8 Malaga                     254.    276.  RanFor       0.676  0.623 RanFor    
##  9 Palma de Mallorca          260.    265.  RanFor       0.679  0.613 RanFor    
## 10 Madrid                     112.     75.1 LM           0.585  0.688 LM

Considering the difference in data used and computing time required, the random forest predictions are disappointing. The most well-known disadvantage of RF is that it struggles to make predictions for out-of-sample values. Perhaps the cities where LM prevails are those which have more variables with maxs or mins outside of the training set.

## # A tibble: 10 × 5
##    city                       oos_vars oos_var_importance better_rmse better_cpc
##    <chr>                         <dbl>              <dbl> <chr>       <chr>     
##  1 Madrid                           16              579.  LM          LM        
##  2 Barcelona                        18              554.  LM          LM        
##  3 Palma de Mallorca                 5              314.  RanFor      RanFor    
##  4 Zaragoza                          6              283.  LM          LM        
##  5 Cordoba                           6              269.  LM          RanFor    
##  6 Sevilla                           8              226.  RanFor      RanFor    
##  7 Malaga                            4              118.  RanFor      RanFor    
##  8 Las Palmas de Gran Canaria        2               29.1 RanFor      RanFor    
##  9 Valencia                          2               22.9 RanFor      RanFor    
## 10 Bilbao                            0                0   RanFor      RanFor

The accuracy of the random forest predictions seems to be directly related to the importance of variables which have out-of-sample values in the test data.

When mapped, are there substantive difference between the model predictions?

The cities below have been selected based on the chart in the previous section in order to illustrate how the random forest regression model performs, compared to the gravity model, in various out-of-sample scenarios.

Here we see that the gravity model, which performs better, does a remarkable job considering what little data it uses and how much less computing time it takes than the random forest model. The RF model, on the other hand, seems to roughly capture the pattern of mobility in Barcelona but overshoots the amount of movement.

For Sevilla, the RF model makes significantly better predictions. Visually, it appears that the RF model better estimates that certain central districts of the city are a hub of mobility. The LM model assumes that the two southern-most districts, which are large and close to each other, will have large flows between them. In reality, flows from those districts to the center of the city are heavier, and the RF model does a better job capturing this.

Based on the analysis of out-of-sample values above, Bilbao is the ideal city for which to make predictions using RF with this data. TheCPCs and RMSEs confirm that the RF outperforms the LM statistically, and the difference is evident when mapped (if you look closely).

What About Sundays?

Below are is the same models and analysis (with models for several test cities still pending), using data for Sundays instead of Wednesdays.

## 
## Call:
##  randomForest(formula = flujo ~ ., data = train.all, importance = T) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 22
## 
##           Mean of squared residuals: 719.1553
##                     % Var explained: 94.57

## # A tibble: 2 × 3
##   model  rmse   cpc
##   <chr> <dbl> <dbl>
## 1 RF     23.7 0.928
## 2 LM     89.3 0.722
## # A tibble: 4 × 2
##   variables                pct_inc_mse
##   <chr>                          <dbl>
## 1 Origin (All)                    32.6
## 2 Destination (All)               25.4
## 3 Origin (Land Cover)             29.9
## 4 Destination (Land Cover)        23.9
## # A tibble: 10 × 7
##    city                     rmse_rf rmse_lm better_rmse cpc_rf cpc_lm better_cpc
##    <chr>                      <dbl>   <dbl> <chr>        <dbl>  <dbl> <chr>     
##  1 Valencia                    77.5    86.3 RanFor       0.769  0.730 RanFor    
##  2 Barcelona                   70.0    63.7 LM           0.739  0.746 LM        
##  3 Zaragoza                    88.2    86.0 LM           0.725  0.738 LM        
##  4 Las Palmas de Gran Cana…    77.2    76.5 LM           0.740  0.721 RanFor    
##  5 Madrid                      75.4    72.1 LM           0.719  0.737 LM        
##  6 Sevilla                     96.0   103.  RanFor       0.721  0.700 RanFor    
##  7 Palma de Mallorca          120.    127.  RanFor       0.721  0.678 RanFor    
##  8 Cordoba                    102.     91.2 LM           0.685  0.713 LM        
##  9 Bilbao                     165.    182.  RanFor       0.710  0.661 RanFor    
## 10 Malaga                     170.    178.  RanFor       0.666  0.639 RanFor
## # A tibble: 10 × 5
##    city                       oos_vars oos_var_importance better_rmse better_cpc
##    <chr>                         <dbl>              <dbl> <chr>       <chr>     
##  1 Madrid                           15              605.  LM          LM        
##  2 Barcelona                        18              584.  LM          LM        
##  3 Palma de Mallorca                 5              348.  RanFor      RanFor    
##  4 Cordoba                           6              272.  LM          LM        
##  5 Zaragoza                          6              271.  LM          LM        
##  6 Sevilla                           8              206.  RanFor      RanFor    
##  7 Malaga                            5              124.  RanFor      RanFor    
##  8 Las Palmas de Gran Canaria        2               30.0 LM          RanFor    
##  9 Valencia                          2               28.7 RanFor      RanFor    
## 10 Bilbao                            0                0   RanFor      RanFor

Findings, In Brief

CPC for each test city (higher is better)